Millennial Migration Patterns

Author

Jack Troxel, Catherine Erickson, Karrine Denisova, Ally Bardas

Published

December 5, 2022

abstract: The three questions in this analysis aimed to answer an overarching question: What are the moving patterns of young adults? The three questions included: What are the most and least mobile cities in the United States? What are the most popular cities to move to by number and ratio of people moving in? What are the movement trends across population sizes and regions? In cities with an upward trend in young adult migration, city planners should build businesses attractive to young adults and the government should allot more funding for public transportation and non-family housing. Additionally, companies looking for recent college graduates should build branches in these cities and target recruiting in the most mobile cities.

1 Background / Motivation

We were motivated to work on this problem because, as college students, we will soon be moving away from home to make our own lives. To help understand where other young adults are moving, we want to analyze their moving patterns.

2 Problem statement

We aim to answer the overarching question: What are the moving patterns of young adults? We attempt to answer this by focusing on three specific topics.

First, we determine which cities in the United States are the most mobile and which are the least mobile. Or, in other words, which cities young adults tend to stay in and which cities young adults tend to move out of. Next, we examine which cities are most popular to move into and where young adults are moving from to get to these cities. We analyze the most popular cities using two different methods — the most popular cities to move to by ratio of people moving there to population, and the most popular cities by number of people moving there. Third, we consider movement trends based specifically upon region and population size.

3 Data sources

The data comes from the website “Migration Patterns,” which accumulated their data from the US Census. The initial commuter zone origin data point was taken from the place of residence of a 16-year-old citizen in 2000. The following commuter zone destination data point was taken from the place of residence of the same citizen in 2010, now a 26-year-old young adult. It is important to note that the variables n (number of people moving to the city of interest), n_tot_o (origin population), and n_tot_d (destination population) are based on the population of solely young adults, not the total population of everyone in the city.

4 Stakeholders

The main stakeholders are the government and policymakers. This analysis will guide these stakeholders to what infrastructure and programs should be developed in certain communities to either support the influx of young adults or support those who decide to stay in their hometown. These resources will be influenced by the income and race of young adults. By providing information about migration patterns of young adults, we hope to help the stakeholders to efficiently allot resources, with limited over- or under-spending. Other stakeholders include companies targeting employees and the real estate industry. This information will help companies know where successful branches might be located if they are seeking young employees or diverse employees. Knowing where young adults are or are not moving in and out of and at what income level will help the real estate industry know what housing options certain towns need.

5 Data quality check / cleaning / preparation

To prepare the data for analysis, we needed to separate the original data set into two separate smaller data sets: one in which the origin and destination city were the same for each observation, and the other in which the origin and destination city were different for each observation. The summary statistics for each of these data sets are shown below.

Distribution of variables in complete data set:

Code
import pandas as pd
data = pd.read_csv('/Users/jtroxel/Desktop/STAT303/project/MigrationPatternsData/od_pooled.csv')
data.loc[:,['n','n_tot_o','n_tot_d','pr_d_o','pr_o_d']].describe()
n n_tot_o n_tot_d pr_d_o pr_o_d
count 5.490810e+05 5.490810e+05 5.490810e+05 549081.000000 549081.000000
mean 5.722583e+01 4.242170e+04 4.242170e+04 0.001350 0.001350
std 3.263886e+03 1.098934e+05 1.116513e+05 0.021159 0.023555
min -2.000000e+00 1.210000e+02 1.220000e+02 -0.010753 -0.008197
25% 0.000000e+00 4.460000e+03 3.611000e+03 0.000000 0.000000
50% 0.000000e+00 1.189300e+04 1.043900e+04 0.000000 0.000000
75% 3.000000e+00 3.383400e+04 3.263000e+04 0.000174 0.000181
max 1.400000e+06 1.755308e+06 1.713298e+06 0.804908 0.896690

Distribution of variables in data set with only same origin and destination:

Code
same_city = data.loc[data.o_cz==data.d_cz,:]
same_city.loc[:,['n','n_tot_o','n_tot_d','pr_d_o','pr_o_d']].describe()
n n_tot_o n_tot_d pr_d_o pr_o_d
count 7.410000e+02 7.410000e+02 7.410000e+02 741.000000 741.000000
mean 2.910734e+04 4.242170e+04 4.242170e+04 0.547464 0.624311
std 8.382501e+04 1.099675e+05 1.117266e+05 0.123248 0.117801
min 3.200000e+01 1.210000e+02 1.220000e+02 0.212815 0.147651
25% 2.123000e+03 4.460000e+03 3.611000e+03 0.462672 0.559933
50% 6.732000e+03 1.189300e+04 1.043900e+04 0.564747 0.634218
75% 2.135700e+04 3.383400e+04 3.263000e+04 0.641874 0.711286
max 1.400000e+06 1.755308e+06 1.713298e+06 0.804908 0.896690

Distribution of variables in data set with only differing origin and destination:

Code
diff_city = data.loc[data.o_cz!=data.d_cz,:]
diff_city.loc[:,['n','n_tot_o','n_tot_d','pr_d_o','pr_o_d']].describe()
n n_tot_o n_tot_d pr_d_o pr_o_d
count 548340.000000 5.483400e+05 5.483400e+05 548340.000000 548340.000000
mean 17.968917 4.242170e+04 4.242170e+04 0.000612 0.000508
std 206.726670 1.098934e+05 1.116513e+05 0.004922 0.003420
min -2.000000 1.210000e+02 1.220000e+02 -0.010753 -0.008197
25% 0.000000 4.460000e+03 3.611000e+03 0.000000 0.000000
50% 0.000000 1.189300e+04 1.043900e+04 0.000000 0.000000
75% 3.000000 3.383400e+04 3.263000e+04 0.000173 0.000179
max 35687.000000 1.755308e+06 1.713298e+06 0.374172 0.235060

6 Exploratory Data Analysis

6.1 Analysis 1

First, we wanted to determine which cities in the United States are the least mobile and which cities are the most mobile. In other words, we wanted to find out which cities millennials chose to leave frequently and which cities millennials chose to stay in. To accomplish this, we first filtered the complete data set to include only observations of millennial movement from one city to that same city. For example, the new filtered data set would include an observation of “migration” from Chicago, IL to Chicago, IL, but would not include an observation of migration from Chicago, IL to Houston, TX.

By using this method to filter the data set, we were able to easily analyze the “pr_d_o” variable, the probability that a given individual lives in destination city “d” given that they grew up in origin city “o”. This statistic would allow us to rank United States cities by the proportion of millennials that chose to stay in each city, thus leaving us with a strong measure of “mobility”. At first, we did not use the “pr_d_o” variable to rank mobility levels of cities, but rather the “n” variable which offers a measure of the total number of millennials from origin “o” living in destination “d”. Using this variable, however, led to relatively useless results as cities with the largest overall populations were given an unfair weight in the rankings. Based on this, we determined that using the “pr_d_o” variable to rank cities would lead to a more useful analysis.

Even if we isolate the search to a specific region, we can find widespread variation. Taking a closer look below at just the midwestern cities from the data set we can see that some cities in this region see upwards of 70% of millennial residents that stay in the city whereas others have less than 30% that stay. Clearly, there are drastic differences in mobility between cities, suggesting that the insight gained from learning about the most and least mobile cities could be useful to stakeholders.

Code
states_to_regions = {
    'Washington': 'West', 'Oregon': 'West', 'California': 'West', 'Nevada': 'West',
    'Idaho': 'West', 'Montana': 'West', 'Wyoming': 'West', 'Utah': 'West',
    'Colorado': 'West', 'Alaska': 'West', 'Hawaii': 'West', 'Maine': 'Northeast',
    'Vermont': 'Northeast', 'New York': 'Northeast', 'New Hampshire': 'Northeast',
    'Massachusetts': 'Northeast', 'Rhode Island': 'Northeast', 'Connecticut': 'Northeast',
    'New Jersey': 'Northeast', 'Pennsylvania': 'Northeast', 'North Dakota': 'Midwest',
    'South Dakota': 'Midwest', 'Nebraska': 'Midwest', 'Kansas': 'Midwest',
    'Minnesota': 'Midwest', 'Iowa': 'Midwest', 'Missouri': 'Midwest', 'Wisconsin': 'Midwest',
    'Illinois': 'Midwest', 'Michigan': 'Midwest', 'Indiana': 'Midwest', 'Ohio': 'Midwest',
    'West Virginia': 'South', 'District of Columbia': 'South', 'Maryland': 'South',
    'Virginia': 'South', 'Kentucky': 'South', 'Tennessee': 'South', 'North Carolina': 'South',
    'Mississippi': 'South', 'Arkansas': 'South', 'Louisiana': 'South', 'Alabama': 'South',
    'Georgia': 'South', 'South Carolina': 'South', 'Florida': 'South', 'Delaware': 'South',
    'Arizona': 'Southwest', 'New Mexico': 'Southwest', 'Oklahoma': 'Southwest',
    'Texas': 'Southwest'}
same_city['o_region'] = same_city.loc[:,'o_state_name'].map(states_to_regions)
same_city_midwest = same_city.loc[same_city['o_region'] == 'Midwest']
a = sns.barplot(data = same_city_midwest, y = 'pr_d_o', x = 'o_cz_name', ci = None)
a.set(xticklabels=[])
a.set_ylabel('Proportion Stayed', size = 20)
a.set_xlabel('Midwest City', size = 20)
a.figure.set_figwidth(20)
a.figure.set_figheight(4)

Now, with the understanding that there is clear variation in mobility levels, we took a closer look at the five most mobile and 5 least mobile cities. These cities can be seen below:

Top 5 Most Mobile Cities are: Harlowton, Montana; Condon, Oregon; Oshkosh, Nebraska; Van Horn, Texas; and Loa, Utah

Code
most = same_city.sort_values(by = 'pr_d_o')
most.rename(columns = {'o_cz_name': 'City', 'o_state_name': 'State', 'pr_d_o': 'Proportion Stayed'}, inplace = True)

Top 5 Least Mobile Cities are: Los Angeles, California; New York, New York; Lafayette, Louisiana; Baton Rouge, Louisiana; Louisville, Kentucky

Code
least = same_city.sort_values(by = 'pr_d_o', ascending = False)
least.rename(columns = {'o_cz_name': 'City', 'o_state_name': 'State', 'pr_d_o': 'Proportion Stayed'}, inplace = True)

As expected, the population sizes of the most and least mobile cities differ. Clearly, the most mobile cities are extremely small, while the most mobile cities are made up of highly-populated well known cities. The average population size of the top-5 most mobile cities is only 370, whereas the average population size of the top-5 least mobile cities is 612,214.

Taking a closer look at both the least and most mobile city, we can see a clear difference in the destination choices of millennials from these two cities. Looking at the charts below, we see that a large proportion of the residents from Los Angeles chose to stay and the proportion of millennials from Los Angeles that ended up in different destinations seems almost insignificant. Looking at the most mobile city (Harlowton, MT), on the other hand, we see a much smaller frequency of millennials staying in town and a general trend of movement to other small cities in Montana. This observation is interesting, but by no means unexpected.

Code
city_state_mobile = same_city.iloc[same_city.pr_d_o.argmin()][['o_cz_name','o_state_name','o_cz']]
city_state_not_mobile = same_city.iloc[same_city.pr_d_o.argmax()][['o_cz_name','o_state_name','o_cz']]
data_mobile = data[data['o_cz'] == city_state_mobile.o_cz]
data_not_mobile = data[data['o_cz'] == city_state_not_mobile.o_cz]
top5_Harlowton = data_mobile.sort_values(by ='pr_d_o',ascending = False)[0:6]
top5_LA = data_not_mobile.sort_values(by ='pr_d_o',ascending = False)[0:6]
sum_prdo = top5_Harlowton.pr_d_o.sum()
top5_Harlowton = top5_Harlowton.reset_index()[['o_cz_name','d_cz_name','pr_d_o']]
top5_Harlowton.loc[len(top5_Harlowton.index)] = ['Harlowton', 'Others', 1-sum_prdo] 
ax=top5_Harlowton.plot(xlabel = 'Destination',ylabel= '% of Movement To Destination',kind='bar',stacked=True, color=['red', 'skyblue', 'orange','yellow','purple','green','brown'],x='d_cz_name', y = 'pr_d_o')
ax.figure.set_figwidth(2)
ax.figure.set_figheight(2)
ax.set(ylim=(0, 0.85))
[(0.0, 0.85)]

Code
sum_prdo = top5_LA.pr_d_o.sum()
top5_LA = top5_LA.reset_index()[['o_cz_name','d_cz_name','pr_d_o']]
top5_LA.loc[len(top5_LA.index)] = ['Los Angeles', 'Others', 1-sum_prdo] 
a=top5_LA.plot(xlabel = 'Destination',ylabel= '% of Movement To Destination',kind='bar',stacked=True, color=['red', 'skyblue', 'orange','yellow','purple','green','brown'],x='d_cz_name', y = 'pr_d_o')
a.figure.set_figwidth(2)
a.figure.set_figheight(2)
a.set(ylim=(0, 0.85))
[(0.0, 0.85)]

6.2 Analysis 2

This section explores which cities are the most popular for young adults to move to. This was discovered through two different methods — finding the top ten most popular cities by the ratio of people moving there divided by people residing there and finding the top ten most popular cities by the number of people moving there.

Code
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
## Cleaning Data: remove rows where the young adults stayed in their origin (did not move)
pooled = pd.read_csv('/Users/catherineerickson/Desktop/STAT303/project/MigrationPatternsData/od_pooled.csv')
pooled=pooled.rename(columns={'n_tot_o':'Origin_Population',
                            'n_tot_d':'Destination_Population'})
pooled_filtered = pooled[pooled['o_cz'] != pooled['d_cz']]
##The most popular places to move based on ratio of number of people who moved there from out of town : population of town
total_n=pooled_filtered[['d_cz', 'n']].groupby('d_cz').sum()
total_n=total_n.reset_index()
populations=pooled_filtered[['d_cz', 'Destination_Population']].groupby('d_cz').mean()
populations=populations.reset_index()
n_and_pop=total_n.merge(populations, on='d_cz')
n_and_pop['Ratio']=n_and_pop.n/n_and_pop.Destination_Population
sorted_ratio=n_and_pop.sort_values(by=['Ratio'], ascending=False)
cities_and_states=pooled_filtered[['d_cz', 'd_cz_name', 'd_state_name']]
cities_and_states=cities_and_states.drop_duplicates()
sorted_by_ratio=sorted_ratio.merge(cities_and_states, on='d_cz')
## Visualizing the number of people moving to the top ten most popular cities by ratio by proportion
ratio=sorted_by_ratio.head(10)
ratio['d_city_state']=ratio['d_cz_name']+', '+ratio['d_state_name']
ratio['in_town']=ratio['Destination_Population']-ratio['n']
ratio['n_percent']=ratio['n']/ratio['Destination_Population']
ratio['in_town_percent']=1-ratio['n_percent']
percent=ratio[['d_city_state', 'n_percent', 'in_town_percent']]
percent=percent.set_index('d_city_state')
ratio_percent=percent.plot.bar(stacked=True, ylabel='Proportion', xlabel='City', figsize=(3, 3))
ratio_percent.tick_params(axis='both', labelsize=10)
ratio_percent.legend(['Population from Out-of-Town', 'Population from In-Town'], bbox_to_anchor=(0.08,1), prop={'size':8})
pd.options.mode.chained_assignment = None

This plot shows the ten most popular towns to move to by ratio, that is, the highest ratio for number of people moving to the town : the population of the town. Initially, we also included a graph showing the population of people from out of town compared to the population of people from in-town for the top ten towns. However, because the population for these towns differs drastically, and all the populations are on the lower side, this graph did not visually provide a lot of insight (ex. the population of Bristol Bay, the most popular town by ratio, is 300 young adults whereas the town with the greatest population, Manhattan, KS, is 25,000 young adults). To initially find the top ten towns by ratio out of the entire data set, we added all the “n” rows for a destination to find the total number of people moving to a destination. We then divided this total n by the destination population and sorted this ratio by greatest to smallest. Finally, we took the top ten greatest ratios. To create the stacked bar plot, we made a dataframe of the top ten most popular towns by ratio and converted population from out-of-town and population from in-town to percentages. We hypothesize that the majority of people moving to these towns are moving from other small towns in-state for work or labor purposes. Finding the top ten towns with the highest ratio of people from out-of-town, helps determine which towns may not be prepared for such an influx of population. Since these are smaller towns, they may need extra support from the government to accomodate for all of these new young adults. However, since the actual number of people moving to these locations is low, we do not advise stakeholders to allot a great deal of resources here.

Code
## Most popular places to move based on number of people moving there
sorted_n=total_n.sort_values(by=['n'], ascending=False)
sorted_by_n=sorted_n.merge(cities_and_states, on='d_cz')
## Visualizing the number of people moving to the top ten most popular cities by number
top_ten_by_n=sorted_by_n.head(10)
pop_only=pooled_filtered[['d_cz', 'Destination_Population']]
d=pop_only.drop_duplicates(subset=['d_cz'])
d_pop=top_ten_by_n.merge(d, on='d_cz')
d_pop['In_town']=d_pop['Destination_Population']-d_pop['n']
in_town=d_pop[['d_cz_name', 'n', 'In_town']]
in_town=in_town.set_index('d_cz_name')
by_pop_plot=in_town.plot.bar(stacked=True, ylabel='Population', xlabel='City', figsize=(3, 3))
by_pop_plot.tick_params(labelsize=10)
by_pop_plot.yaxis.set_major_formatter('{x:,.0f}')
by_pop_plot.legend(['Population from Out-of-Town', 'Population from In-Town'], loc='upper left', prop={'size':8})
pd.options.mode.chained_assignment = None

Code
## Visualizing the number of people moving to the top ten most popular cities by number as a proportion
d_pop['n_percent']=d_pop['n']/d_pop['Destination_Population']
d_pop['in_town_percent']=1-d_pop['n_percent']
percentage=d_pop[['d_cz_name', 'n_percent', 'in_town_percent']]
percentage=percentage.set_index('d_cz_name')
by_pop_plot_proportion=percentage.plot.bar(stacked=True, ylabel='Proportion', xlabel='City', figsize=(2.5, 2.5))
by_pop_plot_proportion.tick_params(axis='both', labelsize=10)
by_pop_plot_proportion.legend(['Population from Out-of-Town', 'Population from In-Town'], bbox_to_anchor=(0.9,1.2), prop={'size':8})
<matplotlib.legend.Legend at 0x7ff6a0fc7a30>

These two plots show the top ten cities to move to by pure number of people moving there. In contrast to the ratio plot, the plots analyzing top ten cities by number are all fairly large cities. However, they are not simply the top ten largest cities in the US. For example, Phoenix, Philadelphia, and San Antonio, the fifth through seventh largest cities, do not make the list for most popular cities. We hypothesize that this is because the cities that make the list instead, such as DC, Atlanta, Seattle, Denver, and San Francisco are more popular because of the industries they attract (such as tech in San Francisco) and the nightlife and social community they provide (such as the outdoor community in Denver). Young adults may prioritize these factore over the family environment that is found in San Antonio.

The first barplot is organized in descending order of most popular to least popular. However, since these cities do have fairly different populations, the second barplot shows the cities that are most popular to move to by proportion. New York City is the most popular by number and Denver is the most popular by proportion.

We found the top ten cities to move to by population by adding the total number of people moving to each individual destination and then ranking these from greatest to least. We created the plots in the same way we created the plot for ratio. For the first plot, we plotted the number of people from out-of-state versus the number of people from in-state. For the second plot, we converted these numbers to percentages and plotted this. Not only do these plots show the top ten cities to move to by population, the proportion graph also gives insight into which cities are growing in popularity. Since Denver is the highest, the population has not caught up to the number of people moving into the city, like it has with Los Angeles. Therefore, we may expect to see a high proportion of people moving to Denver in the future as well, until this trend levels off with a greater population beating out the proportion of people moving in.

One problem we encountered with the analyses for both popularity by ratio and popularity by number was cleaning the dataset to remove the row which accounts for people staying in the city they grew up in. Not removing this row made the “n” variable for every city equal to the population of the city, causing the most popular cities to simply be in order of population. This inspired our analysis for popularity by ratio because we wondered how population size was influencing the results. Although it does have an influence, the fact that the top ten towns by ratio are not simply the ten smallest towns and the top ten cities by population are not simply the top ten largest cities shows that there are other factors influencing the popularity of cities other than just population

6.3 Analysis 3

Next, we explored movement trends across population sizes and regions. The United States Census Bureau identifies two types of urban areas: Urbanized Areas (UAs) of 50,000 or more people; Urban Clusters (UCs) of at least 2,500 and less than 50,000 people. Based on this information, we could group town and city populations into 3 groups, one of population sizes ranging from 120 to 2,500 people, one ranging from 2,500 to 5,0000 people, and the last ranging from 50,000 to 2,000,000 people. These population ranges were chosen based on the urban area definitions. We then divided this into ‘n_tot_o_cat’ and ‘n_tot_d_cat,’ being groups of origin population size and groups of destination population size, respectively.

Code
import pyalluvial.alluvial as alluvial
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
data = pd.read_csv("od_pooled.csv")
data['n_tot_o_cat']=pd.cut(data.n_tot_o,bins=[120,2500,50000,2000000],right=False)
data['n_tot_d_cat']=pd.cut(data.n_tot_d,bins=[120,2500,50000,2000000],right=False)
wide_pop = pd.DataFrame(data.groupby(['n_tot_o_cat','n_tot_d_cat'])['n'].sum()/data.groupby(['n_tot_o_cat'])['n'].sum())
wide_pop= wide_pop.reset_index()
fig = alluvial.plot(df=wide_pop, xaxis_names=['n_tot_o_cat', 'n_tot_d_cat'], y_name='n', alluvium="n_tot_o_cat")
fig.set_figwidth(6)
fig.set_figheight(5)

From there, using an alluvial plot [1], we were able to visualize the proportion of people moving to and from different sized towns and cities. For example, people from cities in the largest population category of 50,000 to 2,000,000 people tend to have a consistent movement pattern, almost always relocating to similarly sized cities and rarely moving to smaller towns. However, it is more common to move to towns or cities of different sizes if you come from a population of the 2 other population ranges. It is not uncommon for individuals coming from the 120 to 2,500 range to relocate to a larger town or big city, as is demonstrated by the thickness of the lines moving across the graph.

Code
states_to_regions = {
    'Washington': 'West', 'Oregon': 'West', 'California': 'West', 'Nevada': 'West',
    'Idaho': 'West', 'Montana': 'West', 'Wyoming': 'West', 'Utah': 'West',
    'Colorado': 'West', 'Alaska': 'West', 'Hawaii': 'West', 'Maine': 'Northeast',
    'Vermont': 'Northeast', 'New York': 'Northeast', 'New Hampshire': 'Northeast',
    'Massachusetts': 'Northeast', 'Rhode Island': 'Northeast', 'Connecticut': 'Northeast',
    'New Jersey': 'Northeast', 'Pennsylvania': 'Northeast', 'North Dakota': 'Midwest',
    'South Dakota': 'Midwest', 'Nebraska': 'Midwest', 'Kansas': 'Midwest',
    'Minnesota': 'Midwest', 'Iowa': 'Midwest', 'Missouri': 'Midwest', 'Wisconsin': 'Midwest',
    'Illinois': 'Midwest', 'Michigan': 'Midwest', 'Indiana': 'Midwest', 'Ohio': 'Midwest',
    'West Virginia': 'South', 'District of Columbia': 'South', 'Maryland': 'South',
    'Virginia': 'South', 'Kentucky': 'South', 'Tennessee': 'South', 'North Carolina': 'South',
    'Mississippi': 'South', 'Arkansas': 'South', 'Louisiana': 'South', 'Alabama': 'South',
    'Georgia': 'South', 'South Carolina': 'South', 'Florida': 'South', 'Delaware': 'South',
    'Arizona': 'Southwest', 'New Mexico': 'Southwest', 'Oklahoma': 'Southwest',
    'Texas': 'Southwest'}
data['o_region'] = data['o_state_name'].map(states_to_regions)
data['d_region'] = data['d_state_name'].map(states_to_regions)
diff_city = data[data['o_cz'] != data['d_cz']]
#To get the pr_d_o for each region, this can be calculated by total number of individual who live in D region from O region/Total number of individuals from O region.
wide_region_pr_d_o = pd.DataFrame(diff_city.groupby(['o_region','d_region'])['n'].sum()/diff_city.groupby(['o_region'])['n'].sum())
wide_region_pr_d_o = wide_region_pr_d_o.reset_index()
fig = alluvial.plot(df=wide_region_pr_d_o, xaxis_names=['o_region','d_region'], y_name='n', alluvium="o_region")
fig.set_figwidth(5)
fig.set_figheight(4)

Further, we put each state in its respective region and created ‘o_region,’ or the origin region, and ‘d_region,’ or the destination region. To find those who moved to a new location, we set a new variable specifying that o_region is not equal to d_region. Then, we calculated the proportion of those relocating to a new region or the same region by dividing the total number of individuals who live in the destination region from the origin region by the total number of individuals from the origin region. We can see from this alluvial plot that the thickest line from each origin region to destination region is within the same region (west to west, southwest to southwest, etc.), people tend to relocate within the same region they are originally from.

Code
pivoted = wide_region_pr_d_o.pivot(index='o_region', columns='d_region', values='n')
ax = pivoted.plot(kind='bar', stacked='True', figsize=(3, 3))
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

It is also interesting to observe a stacked bar plot in order to see what the second most common relocation is. For example, we can observe that the second most common region to relocate to from the West is the Southwest, and the second most common region to relocate to from the Northeast is the South. This could perhaps be explained by proximity, as these destination regions are nearby to the region of origin. Additionally, the South has the largest proportion of those relocating to the same region, compared to the Northeast which has a higher proportion of those relocating to other regions, or the most movement out.

Using these methods, we can continue to observe movement trends to the 5 most populated American cities from each region. Given the information gathered before finding that people tend to move to larger cities and that they tend to move within the same region, our assumption was that people tend to move to the largest city within their region when relocating. For example, the most popular American destination to relocate to as someone from the Midwest would be Chicago. This relocation trend of those from a certain region relocating to the major city in that region agrees with our assumption, aside from the South. Before, we observed that those from the South mostly relocate to the South. However, looking at this plot it is interesting to note that the city people commonly relocate to from the South is not a major city in the South, but New York.

Code
top5 = data.groupby(['d_cz_name','d_state_name'])['n_tot_d'].mean().sort_values(ascending = False)[0:5]
df_top5 = pd.DataFrame(top5).reset_index()
cities = df_top5['d_cz_name'].to_list()
data_top_5 = data[data['d_cz_name'].isin(cities)]
wide_top5 = pd.DataFrame(data_top_5.groupby(['o_region','d_cz_name'])['n'].sum()/data_top_5.groupby(['o_region'])['n'].sum())
wide_top5 = wide_top5.reset_index()
fig = alluvial.plot(df=wide_top5, xaxis_names=['o_region','d_cz_name'], y_name='n', alluvium="o_region")
fig.set_figwidth(5)
fig.set_figheight(4)

Code
pivoted = wide_top5.pivot(index='o_region', columns='d_cz_name', values='n')
ax = pivoted.plot(kind='bar', stacked='True', figsize=(3, 3))
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))

Based on this relocation data across regions and population groups, we have important findings. First, we found a pattern between population sizes and movement. Second, we found a trend between regional movements. And lastly, we found the most popular major city destinations within these regions.

7 Conclusions

The analyses of these three questions all connect together and lead to a bigger question. We use these smaller questions relating to factors that contribute to geographical movement, what trends we can find, which destinations are most popular, distances travelled, who is most mobile, and explore race and income in order to investigate this larger question. All these separate, individual analyses led us to explore overall milennial mobility trends across the United States. These results are helpful for different disciplines, such as sociology, political science, anthropology, geography, law, and economics. These results may inspire further studies on why people move and the consequences of these movements in a broader context, both for those who moved and for the societes impacted.

8 Recommendations to stakeholder(s)

Recommendation 1: City developers should focus on building businesses attractive to young adults in cities where young adults will be moving to in order to retain their popularity as migration destinations and increase economic success. Businesses attractive to young adults include bars, trendy and/or fast-casual restaurants, and co-working spaces. Specifically, developers in already popular cities such as Denver, Atlanta, and Seattle have an opportunity to profit from specific planning targeted towards young adults. Similarly, developers that wish to improve their rates of millennial inflow should consider the same practices to increase the appeal of their cities for young adults. For example, cities that are very populous but not very attractive to young adults such as Phoenix, Philadelphia, and San Antonio may see new economic growth from prioritizing certain types of business development. These businesses could include gyms, bars, inexpensive food/retail options, and other options that attract those in their mid-to-late 20s. By focusing on the development of businesses that are popular for young adults, cities are likely to see strong economic growth and a continuation or increase of young adult migration.

Recommendation 2: The government should increase funding in cities where the proportion of people moving into the city is greatest for the most popular cities. These cities would include Denver, DC, Seattle, and Atlanta. Since these four cities are among the most popular to move to by number and they have the highest proportion of people from out-of-town, they have the greatest influx of young adults. Therefore, these cities likely do not have the same level of infrastructure to support new residents as other popular cities with a smaller proportion of out-of-town residents have, such as Chicago and Los Angeles. The rate of growth in Denver, DC, Seattle, and Atlanta is greater. The increased funding would be most beneficial if put towards public transportation and approving non-family housing, such as apartment complexes. These resources are especially beneficial for young adults who are not as financially stable as older residents. When young people are first entering the workforce, they are less likely to have access to a car for transportation or savings to purchase a home. Therefore, supporting these types of infrastructure in popular cities with large proportions of out-of-town residents would allow these cities to continue to grow.

Recommendation 3: Companies looking to hire recent graduates should place more offices in popular cities for young people. As mentioned above, Denver, DC, Seattle, and Atlanta would be strategic for these companies to have locations given that the proportion of those in the city from out of town is higher. These companies, especially those in need of young labor, can also aim their advertizing (billboards, recruiting events, etc.) to target the most mobile towns, such as Harlowton, Montana. Having offices in Philadelphia, New York, Los Angeles, Houston, and Chicago could also be beneficial as large proportions of young adults from each region relocate to these populated cities. Another factor companies can look at from the data is the desirability of a location based on the distance people are willing to move. For example, large amounts of young adults who move into New York and DC come from out-of-state, whereas the main hotspot for people moving into Los Angeles come from within California, as seen in the appendix.

One possible limitation of the results for stakeholders is that the data is not from the most recent census. In order to obtain the most recent data, we should provide stakeholders with another analysis using the 2010 census as the origin and the 2020 census as the destination. Further analysis that would also help stakeholders is analyzing these patterns over time. This way, the stakeholders would be able to see if the results are a trend, or if they were a fluke in one year. Additionally, by analyzing over time, stakeholders may be able to see connections between action taken during one decade and the moving patterns of young adults. For example, when thinking about the government as a stakeholder, they could see if the differences in funding for public schools affects the mobility of young adults.

References

[1] https://github.com/nekoumei/pyalluvial Changes: variable names/data

[2] “Choropleth.” Choropleth Maps in Python, https://plotly.com/python/choropleth-maps/

[3] “A Python Dictionary to Translate US States to Two Letter Codes.” Gist, https://gist.github.com/rogerallen/1583593

Appendix

Code
pooled_with_only_top_ten_d=pooled_filtered.merge(top_ten_by_n, on='d_cz_name')
pooled_with_only_top_ten_d
abbrev = {'Alabama': 'AL',
        'Alaska': 'AK',
        'Arizona': 'AZ',
        'Arkansas': 'AR',
        'California': 'CA',
        'Colorado': 'CO',
        'Connecticut': 'CT',
        'Delaware': 'DE',
        'District of Columbia': 'DC',
        'Florida': 'FL',
        'Georgia': 'GA',
        'Hawaii': 'HI',
        'Idaho': 'ID',
        'Illinois': 'IL',
        'Indiana': 'IN',
        'Iowa': 'IA',
        'Kansas': 'KS',
        'Kentucky': 'KY',
        'Louisiana': 'LA',
        'Maine': 'ME',
        'Maryland': 'MD',
        'Massachusetts': 'MA',
        'Michigan': 'MI',
        'Minnesota': 'MN',
        'Mississippi': 'MS',
        'Missouri': 'MO',
        'Montana': 'MT',
        'Nebraska': 'NE',
        'Nevada': 'NV',
        'New Hampshire': 'NH',
        'New Jersey': 'NJ',
        'New Mexico': 'NM',
        'New York': 'NY',
        'North Carolina': 'NC',
        'North Dakota': 'ND',
        'Ohio': 'OH',
        'Oklahoma': 'OK',
        'Oregon': 'OR',
        'Pennsylvania': 'PA',
        'Rhode Island': 'RI',
        'South Carolina': 'SC',
        'South Dakota': 'SD',
        'Tennessee': 'TN',
        'Texas': 'TX',
        'Utah': 'UT',
        'Vermont': 'VT',
        'Virginia': 'VA',
        'Washington': 'WA',
        'West Virginia': 'WV',
        'Wisconsin': 'WI',
        'Wyoming': 'WY'}
pooled_with_only_top_ten_d['o_abbrev']=pooled_with_only_top_ten_d['o_state_name'].map(abbrev)
pooled_with_only_top_ten_d['d_abbrev']=pooled_with_only_top_ten_d['d_state_name_x'].map(abbrev)

from urllib.request import urlopen
import json
with urlopen('https://storage.googleapis.com/kagglesdsdata/datasets/831691/1428241/us-states.json?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20221204%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20221204T193358Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=0c24634e87bf688c99b0881eed9612627d6f37139581cad5415d4c4cb6cae33593f899f207dfa61db2770155c2fa6593bc3190efa3450f2ddfc0f1e1847173cd361183d34773bcce056195a2d0cf66c21a0d5fdf316c8f76e660af64a748c6d7d2566181f795015978bd4ae941e9a19e4689e2eb3222ebcabbd12fad0e14e5b245f02ef6d6f4924ca7178327aa91ceeeefc60add514ccf308e675ad6b421d09e03ec0b4aa57b68da23fb097777a100006c919165d5a7c7edaba8fc7d0f5127f975d925a0e09e128ed78230804657f291d6c061eebf868d5e2547082747feada6f98bf56582bd364f51a8f341c4bc8b1d5068ddebf63e3e3e2801b9c5389d8ab4') as response:
    states = json.load(response)

import plotly.express as px
NY=pooled_with_only_top_ten_d[pooled_with_only_top_ten_d['d_cz_name']=='New York']
NY_new=pd.pivot_table(NY, index='o_abbrev', aggfunc=np.sum)
NY_new=NY_new.reset_index()
fig_NY = px.choropleth(NY_new, geojson=states,
                    locations='o_abbrev', 
                    locationmode="USA-states", 
                    scope="usa",
                    color='n_x',
                    color_continuous_scale="Viridis_r",
                    title = 'New York'
                    
                    )

fig_NY.show()
fig_NY.write_html("myplot.html")
Code
LA=pooled_with_only_top_ten_d[pooled_with_only_top_ten_d['d_cz_name']=='Los Angeles']
LA_new=pd.pivot_table(LA, index='o_abbrev', aggfunc=np.sum)
LA_new=LA_new.reset_index()

fig_LA = px.choropleth(LA_new, geojson=states,
                    locations='o_abbrev',
                    color='n_x',
                    color_continuous_scale="Viridis_r",
                    scope = 'usa',
                    title = 'Los Angeles')

fig_LA.show()
fig_LA.write_html("myplot.html")
Code
DC=pooled_with_only_top_ten_d[pooled_with_only_top_ten_d['d_cz_name']=='Washington DC']
DC_new=pd.pivot_table(DC, index='o_abbrev', aggfunc=np.sum)
DC_new=DC_new.reset_index()
fig_DC = px.choropleth(DC_new, geojson=states,
                    locations='o_abbrev', 
                    locationmode="USA-states", 
                    scope="usa",
                    color='n_x',
                    color_continuous_scale="Viridis_r", 
                    title='Washington DC'
                    )

These three choropleth maps show what states people who moved to New York, Los Angeles, or Washington DC originated from. We referenced an online resource to help structure our code for the choropleth maps[2]. We changed the parameters in the source to align with our dataframe, the US states, and the number of people moving to an origin. For each choropleth map we had to create a new dataframe that only included the destinaiton of interest. One problem we ran into when creating these visualizations was the original dataframe uses the full state names rather than the state codes, which plotly.express uses. To solve this, we used the state to state code dictionary in the second citation and mapped it to the dataframe[3].

Out of the top three most popular cities to move to, Los Angeles is the only one where the state of the city of interest is also the most popular origin. This is likely because it is such a large state with several large cities. However, for all three cities graphed, the highest concentration of origins tends to accumulate around the city. Consistent hotspots are compatible with states that have large cities, such as California, Texas, Florida, Illinois, and New York.